Model Selection¶

In which we choose the best model to predict the age of a crab.¶
GitHub Repository¶
Notebook Viewer¶
Kaggle Dataset¶

Table of Contents¶

  1. Define Constants
  2. Import Libraries
  3. Load Data from Cache
  4. Split the Data
  5. Metrics Used
  6. Model Exploration
    1. Naive Linear Regression
    2. Neural Network Model
    3. Neural Network Model (32-16-8-1))
    4. Neural Network Model (16-8-1))
    5. Neural Network Model (8-1))
    6. Neural Network Model (4-1))
    7. Neural Network Model (2-1))
    8. True vs Predicted Age Scatter Plots
    9. Training Loss Over Time Plots
    10. Re-Train the Models Again
    11. Re-Plot the Training Loss Over Time
      1. Training Loss Over More Time Observations
  7. Model Leaderboard
    1. My Criteria
    2. Putting it All Together
    3. Reminder of Our Metrics
    4. Model Type Comparison
      1. Score Comparison Observations
    5. Show the Leaderboard Again
      1. Score These Scores
  8. Choose the Best Architecture for the Job
  9. Hyperparameter Tuning
    1. Hyperparameters
    2. Optimizer Tuning
    3. Optimizer Decision
    4. Learning Rate Tuning
    5. Learning Rate Decision
    6. Loss Function to Mean Absolute Error
    7. Perhaps an Ensemble Will Help
  10. Winner, Winner, Crab's for Dinner!
  11. Onwards to Feature Engineering

Define Constants¶

In [1]:
%%time
CACHE_FILE = '../cache/splitcrabs.feather'
NEXT_NOTEBOOK = '../2-features/features.ipynb'
MODEL_CHECKPOINT_FILE = '../cache/best_model.weights.h5'

PREDICTION_TARGET = 'Age'    # 'Age' is predicted
DATASET_COLUMNS = ['Sex_F','Sex_M','Sex_I','Length','Diameter','Height','Weight','Shucked Weight','Viscera Weight','Shell Weight',PREDICTION_TARGET]
REQUIRED_COLUMNS = [PREDICTION_TARGET]

NUM_EPOCHS = 100
VALIDATION_SPLIT = 0.2
CPU times: total: 0 ns
Wall time: 0 ns

Import Libraries¶

In [2]:
%%time
from notebooks.time_for_crab.mlutils import display_df, generate_neural_pyramid
from notebooks.time_for_crab.mlutils import plot_training_loss, plot_training_loss_from_dict, plot_true_vs_pred_from_dict
from notebooks.time_for_crab.mlutils import score_combine, score_comparator, score_model

import keras

keras_backend = keras.backend.backend()
print(f'Keras version: {keras.__version__}')
print(f'Keras backend: {keras_backend}')
if keras_backend == 'tensorflow':
    import tensorflow as tf
    print(f'TensorFlow version: {tf.__version__}')
    print(f'TensorFlow devices: {tf.config.list_physical_devices()}')
elif keras_backend == 'torch':
    import torch
    print(f'Torch version: {torch.__version__}')
    print(f'Torch devices: {torch.cuda.get_device_name(torch.cuda.current_device())}')
    # torch supports windows-native cuda, but CPU was faster for this task
elif keras_backend == 'jax':
    import jax
    print(f'JAX version: {jax.__version__}')
    print(f'JAX devices: {jax.devices()}')
else:
    print('Unknown backend; Proceed with caution.')

import numpy as np
import pandas as pd

from typing import Generator

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

pd.set_option('mode.copy_on_write', True)
Keras version: 3.3.3
Keras backend: tensorflow
TensorFlow version: 2.16.1
TensorFlow devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
CPU times: total: 375 ms
Wall time: 2.68 s

Load Data from Cache¶

In the exploratory data analysis section, we saved the cleaned and split data to a cache file. Let's load it back.

In [3]:
%%time
crabs = pd.read_feather(CACHE_FILE)
crabs_test = pd.read_feather(CACHE_FILE.replace('.feather', '_test.feather'))

display_df(crabs, show_distinct=True)

# split features from target
X_train = crabs.drop([PREDICTION_TARGET], axis=1)
y_train = crabs[PREDICTION_TARGET]

X_test = crabs_test.drop([PREDICTION_TARGET], axis=1)
y_test = crabs_test[PREDICTION_TARGET]

print(f'X_train: {X_train.shape}')
print(f'X_test: {X_test.shape}')
DataFrame shape: (3031, 11)
First 5 rows:
        Length  Diameter    Height    Weight  Shucked Weight  Viscera Weight  \
3483  1.724609  1.312500  0.500000  50.53125       25.984375        9.429688   
993   1.612305  1.312500  0.500000  41.09375       17.031250        7.273438   
1427  1.650391  1.262695  0.475098  40.78125       19.203125        8.078125   
3829  1.362305  1.150391  0.399902  25.43750        9.664062        4.691406   
1468  1.250000  0.924805  0.375000  30.09375       14.007812        6.320312   

      Shell Weight  Sex_F  Sex_I  Sex_M  Age  
3483     13.070312  False  False   True   12  
993      14.320312   True  False  False   13  
1427      5.046875  False  False   True   11  
3829      9.781250  False  False   True   10  
1468      8.390625  False  False   True    9  
<class 'pandas.core.frame.DataFrame'>
Index: 3031 entries, 3483 to 658
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Length          3031 non-null   float16
 1   Diameter        3031 non-null   float16
 2   Height          3031 non-null   float16
 3   Weight          3031 non-null   float16
 4   Shucked Weight  3031 non-null   float16
 5   Viscera Weight  3031 non-null   float16
 6   Shell Weight    3031 non-null   float16
 7   Sex_F           3031 non-null   bool   
 8   Sex_I           3031 non-null   bool   
 9   Sex_M           3031 non-null   bool   
 10  Age             3031 non-null   int8   
dtypes: bool(3), float16(7), int8(1)
memory usage: 77.0 KB
Info:
None
Length distinct values:
[1.725  1.612  1.65   1.362  1.25   1.6875 1.487  1.5625 1.4375 1.45  ]
Diameter distinct values:
[1.3125 1.263  1.15   0.925  1.2    1.162  0.8877 0.8374 1.388  1.0625]
Height distinct values:
[0.5    0.475  0.4    0.375  0.4624 0.425  0.4126 0.4375 0.2876 0.2625]
Weight distinct values:
[50.53 41.1  40.78 25.44 30.1  45.   32.03 32.38 30.19 29.34]
Shucked Weight distinct values:
[25.98  17.03  19.2    9.664 14.01  19.66  16.16  16.42  14.13  11.37 ]
Viscera Weight distinct values:
[9.43  7.273 8.08  4.69  6.32  9.52  7.242 6.082 5.29  2.623]
Shell Weight distinct values:
[13.07  14.32   5.047  9.78   8.39  11.195  7.51   8.22   7.98  10.914]
Sex_F distinct values:
[False  True]
Sex_I distinct values:
[False  True]
Sex_M distinct values:
[ True False]
Age distinct values:
[12 13 11 10  9  8 17  6 19  7]
X_train: (3031, 10)
X_test: (759, 10)
CPU times: total: 0 ns
Wall time: 16 ms

Metrics Used¶

Throughout this notebook, we will use the following metrics to evaluate the regression model:

Mean Squared Error¶

  • The best score is 0.0
  • Lower is better.
  • Larger errors are penalized more than smaller errors.

Mean Absolute Error¶

  • The best score is 0.0
  • Lower is better.
  • Less sensitive to outliers.

Explained Variance Score¶

  • The best score is 1.0
  • Lower is worse.

R2 Score¶

  • The best score is 1.0
  • Lower is worse.

From the scikit-learn documentation:

Note: The Explained Variance score is similar to the R^2 score, with the notable difference that it does not account for systematic offsets in the prediction. Most often the R^2 score should be preferred.

Model Exploration¶

So far, we have not done any feature engineering, which can often be the most important part of the process. Some new features could be constructed from our dataset which would call for a different model. Nonetheless, we can start by using all features to set a baseline.

We will start with a few simple models to get a baseline accuracy.

We will use the following models:

  • Naive Random Baseline
  • Linear Regression
  • Neural Networks
    • (64-32-16-8-1)
    • (32-16-8-1)
    • (16-8-1)
    • (8-1)
    • (4-1)
    • (2-1)

Naive Linear Regression¶

The simplest model is a naive linear regression model. It is untrained and will make random guesses.

In [4]:
%%time
# layer: input
layer_feature_input = keras.layers.Input(shape=(len(X_train.columns),))

# layer: normalizer
layer_feature_normalizer = keras.layers.Normalization(axis=-1)
layer_feature_normalizer.adapt(np.array(X_train))

# layer: output (linear regression)
layer_feature_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> linear
# initialize the all_models dictionary
all_models = {'linear': keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    layer_feature_output])}

all_models['linear'].summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense)                   │ (None, 1)              │            11 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 32 (132.00 B)
 Trainable params: 11 (44.00 B)
 Non-trainable params: 21 (88.00 B)
CPU times: total: 15.6 ms
Wall time: 50.1 ms

Configure the Linear Model¶

These will be used for all models unless otherwise specified.

  • Optimizer
    • Adam: Adaptive Moment Estimation (Kingma & Ba, 2014)
  • Loss Function
    • Mean Squared Error (MSE)
      • This penalizes larger errors more than smaller errors.
      • We took out outliers in the data cleaning step, so this should perform better.
  • Callbacks
    • Model Checkpoint
      • Save the best model weights.
Define Common Compile Options¶
Define Common Checkpoint Options¶
In [5]:
%%time
# some framework
def next_adam(learning_rate:float=0.001) -> Generator[keras.optimizers.Adam, None, None]:
    """Yield the next Adam optimizer with the given learning rate."""
    yield keras.optimizers.Adam(learning_rate=learning_rate)


def common_compile_options(
        optimizer:keras.Optimizer=None,
        loss_metric:str='mean_squared_error'):
    """Return a dictionary of common compile options.

    :param optimizer: The optimizer to use. Defaults to Adam with LR=0.001.
    :param loss_metric: The loss metric to use. Defaults to 'mean_squared_error'.
    """
    return {
        'optimizer': optimizer if optimizer is not None else next(next_adam()),
        'loss': loss_metric
    }


all_models['linear'].compile(**common_compile_options())

common_checkpoint_options = {
    'monitor': 'val_loss',
    'save_best_only': True,
    'save_weights_only': True,
    'mode': 'min'
}

linear_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_linear.weights.h5'),
    **common_checkpoint_options)
CPU times: total: 0 ns
Wall time: 6 ms

Score the Linear Model (Before Training)¶

In [6]:
%%time
untrained_linear_preds = all_models['linear'].predict(X_test).flatten()
# Utility functions imported from mlutils.py
untrained_linear_scores_df = score_model(untrained_linear_preds, np.array(y_test), index='untrained_linear')
# Add it to the leaderboard
leaderboard_df = score_combine(pd.DataFrame(), untrained_linear_scores_df)
leaderboard_df.head()
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 958us/step
CPU times: total: 31.2 ms
Wall time: 103 ms
Out[6]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124

Train the Linear Model¶

Define Common Fit Options¶
In [7]:
%%time
common_fit_options = {
    'x': X_train,
    'y': y_train,
    'epochs': NUM_EPOCHS,
    'verbose': 0,
    'validation_split': VALIDATION_SPLIT
}

linear_history = all_models['linear'].fit(
    **common_fit_options,
    callbacks=[linear_checkpoint]
)
all_models['linear'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_linear.weights.h5'))
CPU times: total: 1.98 s
Wall time: 7.72 s

Score the Linear Model¶

In [8]:
%%time
linear_preds = all_models['linear'].predict(X_test).flatten()
linear_scores_df = score_model(linear_preds, np.array(y_test), index='linear')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, linear_scores_df)
leaderboard_df.head()
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 438us/step
CPU times: total: 31.2 ms
Wall time: 50.6 ms
Out[8]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear 13.669806 3.097228 -0.187996 -2.867148

Neural Network Model¶

Neural Network Architecture¶

We will start with a deep (64-32-16-8-1) neural network with a few layers, gradually reducing the complexity from our overfit model.

  • Input Layer
    • All of the features, please.
  • Normalizer Layer
    • Adapted to all features in the training data.
  • Hidden Layers
    • Four dense layers each with 64 >> {layer_index} units and ReLU activation.
  • Output Layer
    • Layer with one output.

I know what you're thinking: "Why not start with a simpler model?"

My answer to that: This is for science, and we're going to test them all anyway. It's sometimes easier to copy and delete than it is to build from scratch.

In [9]:
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model

# layer(s): hidden (relu) - 64, 32, 16, 8
num_hidden_layers = 4
num_units = 64
layer_deepest_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)

# layer: output (linear regression)
layer_deepest_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> hidden(s) -> dense
all_models['64_32_16_8_1'] = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    *layer_deepest_hidden_relu_list,
    layer_deepest_output])

all_models['64_32_16_8_1'].summary()
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 64)             │           704 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 32)             │         2,080 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 16)             │           528 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 8)              │           136 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 3,478 (13.59 KB)
 Trainable params: 3,457 (13.50 KB)
 Non-trainable params: 21 (88.00 B)
CPU times: total: 15.6 ms
Wall time: 26 ms

Configure the Neural Network Model¶

  • Optimizer
    • Adam: Adaptive Moment Estimation (Kingma & Ba, 2014)
  • Loss Function
    • Mean Squared Error (MSE)
      • This penalizes larger errors more than smaller errors.
      • We took out outliers in the data cleaning step, so this should perform better.
  • Callbacks
    • Model Checkpoint
In [10]:
%%time
all_models['64_32_16_8_1'].compile(**common_compile_options())

deepest_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_64_32_16_8_1.weights.h5'),
    **common_checkpoint_options)
CPU times: total: 0 ns
Wall time: 1 ms

Train the Neural Network Model¶

We're not going to predict with the untrained model, as we already have a random baseline on the leaderboard.

In [11]:
%%time
deepest_history = all_models['64_32_16_8_1'].fit(
    **common_fit_options,
    callbacks=[deepest_checkpoint]
)
all_models['64_32_16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_64_32_16_8_1.weights.h5'))
CPU times: total: 2.41 s
Wall time: 9.09 s

Score the Neural Network Model¶

In [12]:
%%time
deepest_preds = all_models['64_32_16_8_1'].predict(X_test).flatten()
deepest_scores_df = score_model(deepest_preds, np.array(y_test), index='64_32_16_8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deepest_scores_df)
leaderboard_df.head()
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 
CPU times: total: 93.8 ms
Wall time: 109 ms
Out[12]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear 13.669806 3.097228 -0.187996 -2.867148
64_32_16_8_1 3.746227 1.420128 0.202596 0.202460

Neural Network Model (32-16-8-1)¶

Let's cut the first layer out and see if it still has what it takes.

In [13]:
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model

# layer(s): hidden (relu) - 32, 16, 8
num_hidden_layers = 3
num_units = 32
layer_32_16_8_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)

# layer: output (linear regression)
layer_32_16_8_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> hidden(s) -> dense
all_models['32_16_8_1'] = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    *layer_32_16_8_hidden_relu_list,
    layer_32_16_8_output
])

all_models['32_16_8_1'].summary()
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_6 (Dense)                 │ (None, 32)             │           352 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_7 (Dense)                 │ (None, 16)             │           528 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_8 (Dense)                 │ (None, 8)              │           136 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_9 (Dense)                 │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,046 (4.09 KB)
 Trainable params: 1,025 (4.00 KB)
 Non-trainable params: 21 (88.00 B)
CPU times: total: 0 ns
Wall time: 19 ms

Configure the (32-16-8-1) Neural Network Model¶

In [14]:
%%time
all_models['32_16_8_1'].compile(**common_compile_options())

deep_32_16_8_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_32_16_8_1.weights.h5'),
    **common_checkpoint_options)
CPU times: total: 0 ns
Wall time: 2 ms

Train the (32-16-8-1) Neural Network Model¶

In [15]:
%%time
deep_32_16_8_history = all_models['32_16_8_1'].fit(
    **common_fit_options,
    callbacks=[deep_32_16_8_checkpoint]
)
all_models['32_16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_32_16_8_1.weights.h5'))
CPU times: total: 2.77 s
Wall time: 8.75 s

Score the (32-16-8-1) Neural Network Model¶

In [16]:
%%time
deep_32_16_8_preds = all_models['32_16_8_1'].predict(X_test).flatten()
deep_32_16_8_scores_df = score_model(deep_32_16_8_preds, np.array(y_test), index='32_16_8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_32_16_8_scores_df)
leaderboard_df.head()
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 
CPU times: total: 46.9 ms
Wall time: 103 ms
Out[16]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear 13.669806 3.097228 -0.187996 -2.867148
64_32_16_8_1 3.746227 1.420128 0.202596 0.202460
32_16_8_1 3.892990 1.436031 0.174743 0.174401

Neural Network Model (16-8-1)¶

The last one held up, so let's reduce it even more.

In [17]:
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model

# layer(s): hidden (relu) - 16, 8
num_hidden_layers = 2
num_units = 16
layer_16_8_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)

# layer: output (linear regression)
layer_16_8_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> hidden(s) -> dense
all_models['16_8_1'] = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    *layer_16_8_hidden_relu_list,
    layer_16_8_output])

all_models['16_8_1'].summary()
Model: "sequential_3"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_10 (Dense)                │ (None, 16)             │           176 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_11 (Dense)                │ (None, 8)              │           136 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_12 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 342 (1.34 KB)
 Trainable params: 321 (1.25 KB)
 Non-trainable params: 21 (88.00 B)
CPU times: total: 0 ns
Wall time: 19 ms

Configure the (16-8-1) Neural Network Model¶

In [18]:
%%time
all_models['16_8_1'].compile(**common_compile_options())

deep_16_8_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_16_8.weights.h5'),
    **common_checkpoint_options
)
CPU times: total: 0 ns
Wall time: 2.51 ms

Train the (16-8-1) Neural Network Model¶

In [19]:
%%time
deep_16_8_history = all_models['16_8_1'].fit(
    **common_fit_options,
    callbacks=[deep_16_8_checkpoint]
)
all_models['16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_16_8.weights.h5'))
CPU times: total: 3.41 s
Wall time: 8.32 s

Score the (16-8-1) Neural Network Model¶

In [20]:
%%time
deep_16_8_preds = all_models['16_8_1'].predict(X_test).flatten()
deep_16_8_scores_df = score_model(deep_16_8_preds, np.array(y_test), index='16_8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_16_8_scores_df)
leaderboard_df.head()
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 62.5 ms
Wall time: 87.8 ms
Out[20]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear 13.669806 3.097228 -0.187996 -2.867148
64_32_16_8_1 3.746227 1.420128 0.202596 0.202460
32_16_8_1 3.892990 1.436031 0.174743 0.174401
16_8_1 3.791295 1.422898 0.166290 0.165454

Neural Network Model (8-1)¶

The last reduction didn't lose too much accuracy, so let's continue removing layers.

In [21]:
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model

# layer(s): hidden (relu) - 8
num_hidden_layers = 1
num_units = 8
layer_8_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)

# layer: output (linear regression)
layer_8_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> hidden(s) -> dense
all_models['8_1'] = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    *layer_8_hidden_relu_list,
    layer_8_output])

all_models['8_1'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 118 (476.00 B)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
CPU times: total: 0 ns
Wall time: 12.5 ms

Configure the (8-1) Neural Network Model¶

In [22]:
%%time
all_models['8_1'].compile(**common_compile_options())

deep_8_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8.weights.h5'),
    **common_checkpoint_options
)
CPU times: total: 0 ns
Wall time: 2 ms

Train the (8-1) Neural Network Model¶

In [23]:
%%time
deep_8_history = all_models['8_1'].fit(
    **common_fit_options,
    callbacks=[deep_8_checkpoint]
)
all_models['8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8.weights.h5'))
CPU times: total: 2.45 s
Wall time: 7.92 s

Score the (8-1) Neural Network Model¶

In [24]:
%%time
deep_8_preds = all_models['8_1'].predict(X_test).flatten()
deep_8_scores_df = score_model(deep_8_preds, np.array(y_test), index='8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_8_scores_df)
leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 15.6 ms
Wall time: 79.5 ms
Out[24]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear 13.669806 3.097228 -0.187996 -2.867148
64_32_16_8_1 3.746227 1.420128 0.202596 0.202460
32_16_8_1 3.892990 1.436031 0.174743 0.174401
16_8_1 3.791295 1.422898 0.166290 0.165454
8_1 3.994874 1.469987 0.151188 0.151183

Neural Network Model (4-1)¶

Still not too shabby. Let's reduce the last hidden layer to 4 neurons.

In [25]:
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model

# layer(s): hidden (relu) - 4
num_hidden_layers = 1
num_units = 4
layer_4_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)

# layer: output (linear regression)
layer_4_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> hidden(s) -> dense
all_models['4_1'] = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    *layer_4_hidden_relu_list,
    layer_4_output])

all_models['4_1'].summary()
Model: "sequential_5"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_15 (Dense)                │ (None, 4)              │            44 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_16 (Dense)                │ (None, 1)              │             5 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 70 (284.00 B)
 Trainable params: 49 (196.00 B)
 Non-trainable params: 21 (88.00 B)
CPU times: total: 0 ns
Wall time: 14 ms

Configure the (4-1) Neural Network Model¶

In [26]:
%time
all_models['4_1'].compile(**common_compile_options())

deep_4_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_4.weights.h5'),
    **common_checkpoint_options
)
CPU times: total: 0 ns
Wall time: 0 ns

Train the (4-1) Neural Network Model¶

In [27]:
%%time
deep_4_history = all_models['4_1'].fit(
    **common_fit_options,
    callbacks=[deep_4_checkpoint]
)
all_models['4_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_4.weights.h5'))
CPU times: total: 3.09 s
Wall time: 7.97 s

Score the (4-1) Neural Network Model¶

In [28]:
%%time
deep_4_preds = all_models['4_1'].predict(X_test).flatten()
deep_4_scores_df = score_model(deep_4_preds, np.array(y_test), index='4_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_4_scores_df)
leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 15.6 ms
Wall time: 82.3 ms
Out[28]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear 13.669806 3.097228 -0.187996 -2.867148
64_32_16_8_1 3.746227 1.420128 0.202596 0.202460
32_16_8_1 3.892990 1.436031 0.174743 0.174401
16_8_1 3.791295 1.422898 0.166290 0.165454
8_1 3.994874 1.469987 0.151188 0.151183
4_1 7.346774 2.011863 0.148698 0.058166

Neural Network Model (2-1)¶

The last reduction didn't lose too much accuracy, so let's continue removing layers.

In [29]:
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model

# layer(s): hidden (relu) - 2
num_hidden_layers = 1
num_units = 2
layer_2_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)

# layer: output (linear regression)
layer_2_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> hidden(s) -> dense
all_models['2_1'] = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    *layer_2_hidden_relu_list,
    layer_2_output])

all_models['2_1'].summary()
Model: "sequential_6"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_17 (Dense)                │ (None, 2)              │            22 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_18 (Dense)                │ (None, 1)              │             3 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 46 (188.00 B)
 Trainable params: 25 (100.00 B)
 Non-trainable params: 21 (88.00 B)
CPU times: total: 0 ns
Wall time: 15 ms

Configure the (2-1) Neural Network Model¶

In [30]:
%%time
all_models['2_1'].compile(**common_compile_options())

deep_2_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_2.weights.h5'),
    **common_checkpoint_options
)
CPU times: total: 0 ns
Wall time: 1 ms

Train the (2-1) Neural Network Model¶

In [31]:
%%time
deep_2_history = all_models['2_1'].fit(
    **common_fit_options,
    callbacks=[deep_2_checkpoint]
)
all_models['2_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_2.weights.h5'))
CPU times: total: 2.5 s
Wall time: 8 s

Score the (2-1) Neural Network Model¶

In [32]:
%%time
deep_2_preds = all_models['2_1'].predict(X_test).flatten()
deep_2_scores_df = score_model(deep_2_preds, np.array(y_test), index='2_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_2_scores_df)
leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 15.6 ms
Wall time: 81.2 ms
Out[32]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear 13.669806 3.097228 -0.187996 -2.867148
64_32_16_8_1 3.746227 1.420128 0.202596 0.202460
32_16_8_1 3.892990 1.436031 0.174743 0.174401
16_8_1 3.791295 1.422898 0.166290 0.165454
8_1 3.994874 1.469987 0.151188 0.151183
4_1 7.346774 2.011863 0.148698 0.058166
2_1 7.742756 2.080535 0.144014 0.050772

We're finally showing signs of degredation at the (2-1) model. Let's see how they all compare.

True vs Predicted Age Scatter Plots¶

This gives us a good view of how well the model is predicting the age of the crabs.

In [33]:
%%time
all_preds = {
    'untrained_linear': {'true': y_test, 'pred': untrained_linear_preds},
    'linear': {'true': y_test, 'pred': linear_preds},
    '64_32_16_8_1': {'true': y_test, 'pred': deepest_preds},
    '32_16_8_1': {'true': y_test, 'pred': deep_32_16_8_preds},
    '16_8_1': {'true': y_test, 'pred': deep_16_8_preds},
    '8_1': {'true': y_test, 'pred': deep_8_preds},
    '4_1': {'true': y_test, 'pred': deep_4_preds},
    '2_1': {'true': y_test, 'pred': deep_2_preds}
}

plot_true_vs_pred_from_dict(all_preds, show_target_line=True)
CPU times: total: 31.2 ms
Wall time: 53.1 ms
No description has been provided for this image

True vs Predicted Age Scatter Plot Observations¶

Neat!

***Note**: The line of truth is shown in green.*

Untrained Linear Model¶
  • Very bad.
    • As usual.
Linear Model¶
  • Guesses are lower than the actual crab ages.
    • Older crabs may not be harvested soon enough.
Neural Network Model (64-32-16-8-1)¶
Neural Network Model (32-16-8-1)¶
Neural Network Model (16-8-1)¶
Neural Network Model (8-1)¶
  • All looking good.
    • Some middle-aged crabs are guessed to be older, but this makes sense since crabs stop growing as much after a certain age.
Neural Network Model (4-1)¶
  • Something strange going on here.
    • This model is predicting a disproportionate amount of crabs are 5 years old.
Neural Network Model (2-1)¶
  • Visually similar to the other neural network models.
    • Scores show is making predictions further from the truth.

Training Loss Over Time Plots¶

Now we'll show the training loss over time. This gives us insight into how quickly the model is learning. It can also show us if the model is overfitting or not.

Training loss should decrease over time, but if the validation loss starts to increase, the model is overfitting.

In [34]:
%%time
all_histories = {
    'linear': linear_history,
    '64_32_16_8_1': deepest_history,
    '32_16_8_1': deep_32_16_8_history,
    '16_8_1': deep_16_8_history,
    '8_1': deep_8_history,
    '4_1': deep_4_history,
    '2_1': deep_2_history
}

plot_training_loss_from_dict(all_histories)
CPU times: total: 31.2 ms
Wall time: 53.1 ms
No description has been provided for this image

Training Loss Over Time Observations¶

Pretty cool, huh?

***Note**: These models have some overhead involved in training, so it's not as simple as "more neurons = better". Sometimes a simple ML algorithm can do the trick in milliseconds.*

Linear Model¶
  • Never even showed up to the party.
  • Exceeds a Mean Squared Error of 10.
Neural Network Model (64-32-16-8-1)¶
  • Clearly overfitting already.
  • Gets the gist quickly.
Neural Network Model (32-16-8-1)¶
  • Looking good.
  • Also gets to the gist quickly.
Neural Network Model (16-8-1)¶
  • Similar to the (32-16-8-1) model.
    • Less variance in the training loss.
Neural Network Model (8-1)¶
  • The curve is smoothing out.
Neural Network Model (4-1)¶
  • Not as quick to converge.
Neural Network Model (2-1)¶
  • Lagging behind.
    • Perhaps more epochs will give this model a chance.

Re-Train the Models Again¶

Let's start over, but this time for longer.

Give them 5x as many epochs this time.

Linear Model¶
In [35]:
%%time
# add more epochs
#     common_fit_options = {
#         'x': X_train,
#         'y': y_train,
#         'epochs': NUM_EPOCHS*5,
#         'verbose': 0,
#         'validation_split': VALIDATION_SPLIT
#     }
common_fit_options['epochs'] = NUM_EPOCHS*5 # give them 5x as many epochs

# reset the linear model
all_models['linear'] = keras.models.clone_model(all_models['linear'])
all_models['linear'].compile(**common_compile_options())

all_histories.update({'linear': all_models['linear'].fit(
    **common_fit_options,
    callbacks=[linear_checkpoint])})

all_models['linear'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_linear.weights.h5'))

plot_training_loss(all_histories['linear'], 'Linear Model')
CPU times: total: 7.52 s
Wall time: 35 s
No description has been provided for this image
Neural Network Model (64-32-16-8-1)¶
In [36]:
%%time
# reset the (64-32-16-8-1) model
all_models['64_32_16_8_1'] = keras.models.clone_model(all_models['64_32_16_8_1'])
all_models['64_32_16_8_1'].compile(**common_compile_options())

all_histories.update({'64_32_16_8_1':
    all_models['64_32_16_8_1'].fit(
        **common_fit_options,
        callbacks=[deepest_checkpoint])})

all_models['64_32_16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_64_32_16_8_1.weights.h5'))

plot_training_loss(all_histories['64_32_16_8_1'], '64-32-16-8-1 NN Model')
CPU times: total: 8.88 s
Wall time: 40.4 s
No description has been provided for this image

(64-32-16-8-1) is definitely overfitting. Let's try the next one.

Neural Network Model (32-16-8-1)¶
In [37]:
%%time
# reset the (32-16-8-1) model
all_models['32_16_8_1'] = keras.models.clone_model(all_models['32_16_8_1'])
all_models['32_16_8_1'].compile(**common_compile_options())

all_histories.update({'32_16_8_1':
    all_models['32_16_8_1'].fit(
        **common_fit_options,
        callbacks=[deep_32_16_8_checkpoint])})

all_models['32_16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_32_16_8_1.weights.h5'))

plot_training_loss(all_histories['32_16_8_1'], '32-16-8-1 NN Model')
CPU times: total: 9.72 s
Wall time: 38.6 s
No description has been provided for this image

(32-16-8-1) still overfitting. Let's keep going.

Neural Network Model (16-8-1)¶
In [38]:
%%time
# reset the (16-8-1) model
all_models['16_8_1'] = keras.models.clone_model(all_models['16_8_1'])
all_models['16_8_1'].compile(**common_compile_options())

all_histories.update({'16_8_1':
    all_models['16_8_1'].fit(
        **common_fit_options,
        callbacks=[deep_16_8_checkpoint])})

all_models['16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_16_8.weights.h5'))

plot_training_loss(all_histories['16_8_1'], '16-8-1 NN Model')
CPU times: total: 7.83 s
Wall time: 37.4 s
No description has been provided for this image

Validation loss is remaining steady, and the training loss is decreasing ever so slightly. It might be overfitting, but it's hard to tell.

Neural Network Model (8-1)¶
In [39]:
%%time
# reset the (8-1) model
all_models['8_1'] = keras.models.clone_model(all_models['8_1'])
all_models['8_1'].compile(**common_compile_options())

all_histories.update({'8_1':
    all_models['8_1'].fit(
        **common_fit_options,
        callbacks=[deep_8_checkpoint])})

all_models['8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8.weights.h5'))

plot_training_loss(all_histories['8_1'], '8-1 NN Model')
CPU times: total: 8.72 s
Wall time: 37.2 s
No description has been provided for this image

(8-1) doesn't seem to be overfitting. Let's keep it in mind.

Neural Network Model (4-1)¶
In [40]:
%%time
# reset the (4-1) model
all_models['4_1'] = keras.models.clone_model(all_models['4_1'])
all_models['4_1'].compile(**common_compile_options())

all_histories.update({'4_1':
    all_models['4_1'].fit(
        **common_fit_options,
        callbacks=[deep_4_checkpoint])})

all_models['4_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_4.weights.h5'))

plot_training_loss(all_histories['4_1'], '4-1 NN Model')
CPU times: total: 7.02 s
Wall time: 36.7 s
No description has been provided for this image

(4-1) Looks pretty good!

Neural Network Model (2-1)¶
In [41]:
%%time
# reset the (2-1) model
all_models['2_1'] = keras.models.clone_model(all_models['2_1'])
all_models['2_1'].compile(**common_compile_options())

all_histories.update({'2_1':
    all_models['2_1'].fit(
        **common_fit_options,
        callbacks=[deep_2_checkpoint])})

all_models['2_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_2.weights.h5'))

plot_training_loss(all_histories['2_1'], '2-1 NN Model')
CPU times: total: 8.72 s
Wall time: 36.7 s
No description has been provided for this image

Re-Plot Training Loss Over Time¶

Over 500 epochs, we can see how the models are learning.

In [42]:
%%time
plot_training_loss_from_dict(all_histories)
CPU times: total: 31.2 ms
Wall time: 50.6 ms
No description has been provided for this image

Training Loss Over More Time Observations¶

Cool stuff!

Linear Model¶
  • Finally showed up to the party.
  • Shares a convergence with the (2-1) model to MSE of ~4.
Neural Network Model (64-32-16-8-1)¶
  • Obviously overfitting.
Neural Network Model (32-16-8-1)¶
  • Also overfitting.
Neural Network Model (16-8-1)¶
Neural Network Model (8-1)¶
  • Similar to the more complex neural networks.
    • Less variance as the number of neurons decreases.
Neural Network Model (4-1)¶
  • After a bumpy start, it got the hang of it.
    • Less variance in the training and validation loss.
Neural Network Model (2-1)¶
  • It never caught up.
  • But it's not overfitting so much, so that's good.

***Note**: Implementing Early Stopping on these models resulted in early terminations in most cases.*

Model Leaderboard¶

Let's re-score the models and see how they compare.

My Criteria¶

  • Mean Absolute Error within 2 years.
  • Reasonable Explained Variance Score
  • Reasonable R2 Score
  • Avoid Overfitting
  • Reasonable Learning Rate
In [43]:
%%time
# score each model
all_models = {
    'linear': all_models['linear'],
    '64_32_16_8_1': all_models['64_32_16_8_1'],
    '32_16_8_1': all_models['32_16_8_1'],
    '16_8_1': all_models['16_8_1'],
    '8_1': all_models['8_1'],
    '4_1': all_models['4_1'],
    '2_1': all_models['2_1']
}

# score on the test set
for model_name, model in all_models.items():
    preds = model.predict(X_test).flatten()
    scores_df = score_model(preds, np.array(y_test), index=model_name)
    leaderboard_df = score_combine(leaderboard_df, scores_df)

# copy untrained linear model scores - random doesn't get another chance here for time's sake
training_leaderboard_df = leaderboard_df.loc[['untrained_linear']]
# score on the training set
for model_name, model in all_models.items():
    preds = model.predict(X_train).flatten()
    scores_df = score_model(preds, np.array(y_train), index=model_name+'_train')
    training_leaderboard_df = score_combine(training_leaderboard_df, scores_df)

leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 914us/step
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 402us/step
95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 460us/step
95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 484us/step
95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 478us/step
95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 466us/step
95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 456us/step
95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 463us/step
CPU times: total: 453 ms
Wall time: 1.34 s
Out[43]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear 3.997827 1.473956 0.011232 0.010781
64_32_16_8_1 3.630893 1.404292 0.302562 0.302399
32_16_8_1 3.602257 1.385743 0.338213 0.337592
16_8_1 3.807182 1.415440 0.280600 0.279393
8_1 3.794136 1.432214 0.228980 0.228786
4_1 3.901053 1.461622 0.178054 0.177953
2_1 3.946111 1.468480 0.044709 0.044348

Test Set Leaderboard Observations¶

Everyone but the random model did pretty well. Let's see how they did on the training set.

In [44]:
%%time
training_leaderboard_df[:]
CPU times: total: 0 ns
Wall time: 0 ns
Out[44]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear_train 3.958444 1.475843 0.047926 0.047916
64_32_16_8_1_train 3.003210 1.266670 0.452759 0.452592
32_16_8_1_train 3.251329 1.309139 0.419088 0.418564
16_8_1_train 3.341644 1.325369 0.342692 0.342082
8_1_train 3.571968 1.389613 0.268823 0.268768
4_1_train 3.719555 1.432148 0.233845 0.233813
2_1_train 3.923570 1.470982 0.080701 0.080701

Training Set Leaderboard Observations¶

Everyone did better, as expected. Hopefully, they didn't do too much better. That would signal overfitting.

Putting it All Together¶

In [45]:
%%time
combined_leaderboard_df = score_combine(leaderboard_df, training_leaderboard_df).sort_index()
combined_leaderboard_df[:]
CPU times: total: 0 ns
Wall time: 1e+03 µs
Out[45]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
16_8_1 3.807182 1.415440 0.280600 0.279393
16_8_1_train 3.341644 1.325369 0.342692 0.342082
2_1 3.946111 1.468480 0.044709 0.044348
2_1_train 3.923570 1.470982 0.080701 0.080701
32_16_8_1 3.602257 1.385743 0.338213 0.337592
32_16_8_1_train 3.251329 1.309139 0.419088 0.418564
4_1 3.901053 1.461622 0.178054 0.177953
4_1_train 3.719555 1.432148 0.233845 0.233813
64_32_16_8_1 3.630893 1.404292 0.302562 0.302399
64_32_16_8_1_train 3.003210 1.266670 0.452759 0.452592
8_1 3.794136 1.432214 0.228980 0.228786
8_1_train 3.571968 1.389613 0.268823 0.268768
linear 3.997827 1.473956 0.011232 0.010781
linear_train 3.958444 1.475843 0.047926 0.047916
untrained_linear 101.943787 9.748023 0.049686 -13.000124

Reminder of Our Metrics¶

Mean Squared Error¶

  • The best score is 0.0
  • Lower is better.

Mean Absolute Error¶

  • The best score is 0.0
  • Lower is better.
  • Less sensitive to outliers.

Explained Variance Score¶

  • The best score is 1.0
  • Lower is worse.

R2 Score¶

  • The best score is 1.0
  • Lower is worse.

Model Type Comparison¶

***Note**: Exclude the untrained linear model from these graphs for clarity.*

R2 Score¶

  • Explained Variance Score
  • R2 Score
In [99]:
%%time
clarified_leaderboard_df = leaderboard_df.drop('untrained_linear')[['r2_score', 'explained_variance_score']]
clarified_leaderboard_df.plot(kind='bar', title='Feature-Rich vs Deep Learning Model R2 Scores', figsize=(20, 10))
CPU times: total: 0 ns
Wall time: 21.6 ms
Out[99]:
<Axes: title={'center': 'Feature-Rich vs Deep Learning Model R2 Scores'}>
No description has been provided for this image

Mean Squared Error¶

In [101]:
%%time
clarified_leaderboard_df = leaderboard_df.drop('untrained_linear')[['mean_squared_error', 'mean_absolute_error']]
clarified_leaderboard_df.plot(kind='bar', title='Feature-Rich vs Deep Learning Model MSE Scores', figsize=(20, 10))
CPU times: total: 0 ns
Wall time: 21 ms
Out[101]:
<Axes: title={'center': 'Feature-Rich vs Deep Learning Model MSE Scores'}>
No description has been provided for this image

Score Comparison Observations¶

Neural Network Model (64-32-16-8-1)¶

(64-32-16-8-1) is definitely overfitting.

Neural Network Model (32-16-8-1)¶

(32-16-8-1) is overfitting.

Neural Network Model (16-8-1)¶

(16-8-1) is overfitting.

Neural Network Model (8-1)¶

(8-1) is not overfitting too much.

Neural Network Model (4-1)¶

(4-1) is not overfitting too much either.

Neural Network Model (2-1)¶

(2-1) is not overfitting too much either.

Show the Leaderboard Again¶

In [47]:
%%time
leaderboard_df[:]
CPU times: total: 0 ns
Wall time: 0 ns
Out[47]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear 3.997827 1.473956 0.011232 0.010781
64_32_16_8_1 3.630893 1.404292 0.302562 0.302399
32_16_8_1 3.602257 1.385743 0.338213 0.337592
16_8_1 3.807182 1.415440 0.280600 0.279393
8_1 3.794136 1.432214 0.228980 0.228786
4_1 3.901053 1.461622 0.178054 0.177953
2_1 3.946111 1.468480 0.044709 0.044348

On Training Data¶

Hopefully they did not do much better than their test counterparts.

In [48]:
%%time
training_leaderboard_df[:]
CPU times: total: 0 ns
Wall time: 0 ns
Out[48]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
untrained_linear 101.943787 9.748023 0.049686 -13.000124
linear_train 3.958444 1.475843 0.047926 0.047916
64_32_16_8_1_train 3.003210 1.266670 0.452759 0.452592
32_16_8_1_train 3.251329 1.309139 0.419088 0.418564
16_8_1_train 3.341644 1.325369 0.342692 0.342082
8_1_train 3.571968 1.389613 0.268823 0.268768
4_1_train 3.719555 1.432148 0.233845 0.233813
2_1_train 3.923570 1.470982 0.080701 0.080701

Score These Scores¶

Why not?

These scores will show the level of similarity between the prediction on the test set vs. the training set.

This could be a good way to see if the model is overfitting or underfitting.

In [49]:
%%time
score_score_leaderboard_df = pd.DataFrame()

for model_name in leaderboard_df.index:
    if model_name == 'untrained_linear':
        continue
    score_score_leaderboard_df = score_combine(
        score_score_leaderboard_df,
        score_model(
            leaderboard_df.loc[[model_name]].transpose(),
            training_leaderboard_df.loc[[f'{model_name}_train']].transpose(), index=model_name
        )
    )

score_score_leaderboard_df[:]
CPU times: total: 0 ns
Wall time: 39.1 ms
Out[49]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
linear 0.001070 0.028775 0.999628 0.999597
64_32_16_8_1 0.114511 0.266424 0.945298 0.937982
32_16_8_1 0.035529 0.147345 0.982482 0.979998
16_8_1 0.058156 0.170098 0.977551 0.971957
8_1 0.013590 0.086148 0.994594 0.993585
4_1 0.010011 0.080656 0.995934 0.995667
2_1 0.000783 0.024347 0.999759 0.999692

Choose the Best Architecture for the Job¶

Those pesky crabs don't want us to know how old they are. We'll find out soon enough.

First, let's choose the architecture to tune.

My Criteria¶

  • Mean Absolute Error within 2 years.
  • Reasonable Explained Variance Score
  • Reasonable R2 Score
  • Avoid Overfitting
  • Reasonable Learning Rate

Based on low MSE, high R2, and high Explained Variance, my choice is the (8-1) neural network architecture.

Pursue the (8-1) Neural Network Architecture¶

Let's try some hyperparameter tuning on the (8-1) neural network model.

Why Not the (4-1) Neural Network Architecture?¶

Despite the (4-1) neural network model performing better over 500 epochs, it has some strange predictions in only 100 epochs.

In the interest of time and hyperparameter tuning, we'll stick with the (8-1) neural network model since it is good and faster to train to an acceptable level.

Hyperparameter Tuning¶

Next, we will tune the hyperparameters of the (8-1) neural network model.

Hyperparameters¶

  • Optimizers (adam, nadam, rmsprop, sgd, adagrad, adadelta, adamax)
  • Learning rates (0.1, 0.01, 0.001, 0.0001, etc.)
  • Loss functions (mean_squared_error, mean_absolute_error, etc.)
Let's reset the number of epochs to the original value.¶

Save some time since not much progress was made with more epochs.

In [50]:
%%time
common_fit_options['epochs'] = 100
CPU times: total: 0 ns
Wall time: 0 ns

Optimizer Tuning¶

Next we'll try compiling the (8-1) neural network model with different optimizers to look for any improvements.

We'll try the following optimizers:

  • Adam
  • Nadam
  • RMSprop
  • Stochoastic Gradient Descent (SGD)
  • Adagrad
  • Adadelta
  • Adamax

Adam Optimizer¶

We have already been using the Adam optimizer, but let's try it again to get a baseline.

Adam is a popular optimizer that combines the best of Adagrad and RMSprop.

Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments. According to Kingma et al., 2014, the method is "computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters".

In [51]:
%%time
all_models['8_1_Adam'] = keras.models.clone_model(all_models['8_1'])

compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.Adam()

all_models['8_1_Adam'].compile(**compile_options)

deep_8_Adam_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adam.weights.h5'),
    **common_checkpoint_options)

# initialize history dictionary
optimizer_histories = {'8_1_Adam': \
    all_models['8_1_Adam'].fit(
        **common_fit_options,
        callbacks=[deep_8_Adam_checkpoint])}

all_models['8_1_Adam'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adam.weights.h5'))

all_models['8_1_Adam'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 314 (1.23 KB)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 196 (788.00 B)
CPU times: total: 1.95 s
Wall time: 8.29 s
Adam Optimizer Training Loss Plot¶
In [52]:
%%time
plot_training_loss(optimizer_histories['8_1_Adam'], '8-1 NN Model (Adam)')
CPU times: total: 0 ns
Wall time: 8.51 ms
No description has been provided for this image
Adam Optimizer Score¶
In [53]:
%%time
chosen_arch_preds = {} # initialize prediction dictionary
chosen_arch_preds.update({'8_1_Adam': all_models['8_1_Adam'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_Adam'], np.array(y_test), index='8_1_Adam')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(pd.DataFrame(), deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 15.6 ms
Wall time: 84.1 ms
Out[53]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_Adam 3.874307 1.433025 0.14167 0.139598

Nadam Optimizer¶

Nadam is Adam with Nesterov momentum. It should converge faster than Adam by incorporating a look-ahead feature.

In [54]:
%%time
all_models['8_1_Nadam'] = keras.models.clone_model(all_models['8_1'])

compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.Nadam()

all_models['8_1_Nadam'].compile(**compile_options)

deep_8_Nadam_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Nadam.weights.h5'),
    **common_checkpoint_options)

optimizer_histories['8_1_Nadam'] = \
    all_models['8_1_Nadam'].fit(
        **common_fit_options,
        callbacks=[deep_8_Nadam_checkpoint])

all_models['8_1_Nadam'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Nadam.weights.h5'))

all_models['8_1_Nadam'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 315 (1.24 KB)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 197 (792.00 B)
CPU times: total: 2.81 s
Wall time: 8.07 s
Nadam Optimizer Training Loss Plot¶

It does seem to converge slightly faster than Adam.

In [55]:
%%time
plot_training_loss(optimizer_histories['8_1_Nadam'], '8-1 NN Model (Nadam)')
CPU times: total: 0 ns
Wall time: 8.01 ms
No description has been provided for this image
Nadam Optimizer Score¶
In [56]:
%%time
chosen_arch_preds.update({'8_1_Nadam': all_models['8_1_Nadam'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_Nadam'], np.array(y_test), index='8_1_Nadam')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 15.6 ms
Wall time: 83.6 ms
Out[56]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_Adam 3.874307 1.433025 0.141670 0.139598
8_1_Nadam 3.840628 1.450489 0.157823 0.156640

RMSprop Optimizer¶

RMSprop is a good choice for recurrent neural networks. It's similar to Adagrad, but it uses a moving average of the squared gradient for normalization.

In [57]:
%%time
all_models['8_1_RMSprop'] = keras.models.clone_model(all_models['8_1'])

compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.RMSprop()

all_models['8_1_RMSprop'].compile(**compile_options)

deep_8_RMSprop_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_RMSprop.weights.h5'),
    **common_checkpoint_options)

optimizer_histories['8_1_RMSprop'] = \
    all_models['8_1_RMSprop'].fit(
        **common_fit_options,
        callbacks=[deep_8_RMSprop_checkpoint])

all_models['8_1_RMSprop'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_RMSprop.weights.h5'))

all_models['8_1_RMSprop'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 217 (876.00 B)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 99 (400.00 B)
CPU times: total: 1.69 s
Wall time: 8.1 s
RMSprop Optimizer Training Loss Plot¶
In [58]:
%%time
plot_training_loss(optimizer_histories['8_1_RMSprop'], '8-1 NN Model (RMSprop)')
CPU times: total: 15.6 ms
Wall time: 9.51 ms
No description has been provided for this image
RMSprop Optimizer Score¶
In [59]:
%%time
chosen_arch_preds.update({'8_1_RMSprop': all_models['8_1_RMSprop'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_RMSprop'], np.array(y_test), index='8_1_RMSprop')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 46.9 ms
Wall time: 92.6 ms
Out[59]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_Adam 3.874307 1.433025 0.141670 0.139598
8_1_Nadam 3.840628 1.450489 0.157823 0.156640
8_1_RMSprop 3.843666 1.453355 0.216134 0.215187

Stochastic Gradient Descent (SGD) Optimizer¶

SGD is the classic optimizer. It's a good choice for shallow networks or small datasets.

In [60]:
%%time
all_models['8_1_SGD'] = keras.models.clone_model(all_models['8_1'])

compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.SGD()

all_models['8_1_SGD'].compile(**compile_options)

deep_8_SGD_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_SGD.weights.h5'),
    **common_checkpoint_options)

optimizer_histories['8_1_SGD'] = \
    all_models['8_1_SGD'].fit(
        **common_fit_options,
        callbacks=[deep_8_SGD_checkpoint])

all_models['8_1_SGD'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_SGD.weights.h5'))

all_models['8_1_SGD'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 120 (488.00 B)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 2 (12.00 B)
CPU times: total: 1.03 s
Wall time: 7.81 s
SGD Optimizer Training Loss Plot¶
In [61]:
%%time
plot_training_loss(optimizer_histories['8_1_SGD'], '8-1 NN Model (SGD)')
CPU times: total: 0 ns
Wall time: 7.52 ms
No description has been provided for this image
SGD Optimizer Score¶

That training loss is crazy. Hopefully the test scores are better.

In [62]:
%%time
chosen_arch_preds.update({'8_1_SGD': all_models['8_1_SGD'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_SGD'], np.array(y_test), index='8_1_SGD')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 15.6 ms
Wall time: 88.7 ms
Out[62]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_Adam 3.874307 1.433025 0.141670 0.139598
8_1_Nadam 3.840628 1.450489 0.157823 0.156640
8_1_RMSprop 3.843666 1.453355 0.216134 0.215187
8_1_SGD 4.126909 1.485601 -0.126507 -0.131760

Adagrad Optimizer¶

Adagrad is a good choice for sparse data. It adapts the learning rate based on the frequency of features.

In [63]:
%%time
all_models['8_1_Adagrad'] = keras.models.clone_model(all_models['8_1'])

compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.Adagrad()

all_models['8_1_Adagrad'].compile(**compile_options)

deep_8_Adagrad_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adagrad.weights.h5'),
    **common_checkpoint_options)

optimizer_histories['8_1_Adagrad'] = \
    all_models['8_1_Adagrad'].fit(
        **common_fit_options,
        callbacks=[deep_8_Adagrad_checkpoint])

all_models['8_1_Adagrad'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adagrad.weights.h5'))

all_models['8_1_Adagrad'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 217 (876.00 B)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 99 (400.00 B)
CPU times: total: 1.7 s
Wall time: 8.42 s
Adagrad Optimizer Training Loss Plot¶
In [64]:
%%time
plot_training_loss(optimizer_histories['8_1_Adagrad'], '8-1 NN Model (Adagrad)')
CPU times: total: 0 ns
Wall time: 7.51 ms
No description has been provided for this image
Adagrad Optimizer Score¶

Is it just me, or did Adagrad not learn anything yet?

In [65]:
%%time
chosen_arch_preds.update({'8_1_Adagrad': all_models['8_1_Adagrad'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_Adagrad'], np.array(y_test), index='8_1_Adagrad')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 0 ns
Wall time: 88.6 ms
Out[65]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_Adam 3.874307 1.433025 0.141670 0.139598
8_1_Nadam 3.840628 1.450489 0.157823 0.156640
8_1_RMSprop 3.843666 1.453355 0.216134 0.215187
8_1_SGD 4.126909 1.485601 -0.126507 -0.131760
8_1_Adagrad 14.886531 3.162995 0.271905 0.184537

Adadelta Optimizer¶

Adadelta is a good choice for large datasets.

Adadelta optimization is a stochastic gradient descent method that is based on adaptive learning rate per dimension to address two drawbacks:

  • The continual decay of learning rates throughout training.
  • The need for a manually selected global learning rate.

If its namesake is any indication, it might not do so well here. We'll see.

In [66]:
%%time
all_models['8_1_Adadelta'] = keras.models.clone_model(all_models['8_1'])

compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.Adadelta()

all_models['8_1_Adadelta'].compile(**compile_options)

deep_8_Adadelta_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adadelta.weights.h5'),
    **common_checkpoint_options)

optimizer_histories['8_1_Adadelta'] = \
    all_models['8_1_Adadelta'].fit(
        **common_fit_options,
        callbacks=[deep_8_Adadelta_checkpoint])

all_models['8_1_Adadelta'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adadelta.weights.h5'))

all_models['8_1_Adadelta'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 314 (1.23 KB)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 196 (788.00 B)
CPU times: total: 1.14 s
Wall time: 8.23 s
Adadelta Optimizer Training Loss Plot¶
In [67]:
%%time
plot_training_loss(optimizer_histories['8_1_Adadelta'], '8-1 NN Model (Adadelta)')
CPU times: total: 0 ns
Wall time: 8.51 ms
No description has been provided for this image
Adadelta Optimizer Score¶

It's not looking good for Adadelta based on the training loss plot.

Maybe the scores will redeem it?

In [68]:
%%time
chosen_arch_preds.update({'8_1_Adadelta': all_models['8_1_Adadelta'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_Adadelta'], np.array(y_test), index='8_1_Adadelta')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 31.2 ms
Wall time: 89.6 ms
Out[68]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_Adam 3.874307 1.433025 0.141670 0.139598
8_1_Nadam 3.840628 1.450489 0.157823 0.156640
8_1_RMSprop 3.843666 1.453355 0.216134 0.215187
8_1_SGD 4.126909 1.485601 -0.126507 -0.131760
8_1_Adagrad 14.886531 3.162995 0.271905 0.184537
8_1_Adadelta 18.399206 3.530084 0.263624 0.094465

Adamax Optimizer¶

Adamax is a variant of Adam based on infinity norm.

It is suited for time-variant processes, so it might not be the best choice here. Let's try it anyway.

In [69]:
%%time
all_models['8_1_Adamax'] = keras.models.clone_model(all_models['8_1'])

compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.Adamax()

all_models['8_1_Adamax'].compile(**compile_options)

deep_8_Adamax_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adamax.weights.h5'),
    **common_checkpoint_options)

optimizer_histories['8_1_Adamax'] = \
    all_models['8_1_Adamax'].fit(
        **common_fit_options,
        callbacks=[deep_8_Adamax_checkpoint])

all_models['8_1_Adamax'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adamax.weights.h5'))

all_models['8_1_Adamax'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 314 (1.23 KB)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 196 (788.00 B)
CPU times: total: 1.05 s
Wall time: 8.1 s
Adamax Optimizer Training Loss Plot¶
In [70]:
%%time
plot_training_loss(optimizer_histories['8_1_Adamax'], '8-1 NN Model (Adamax)')
CPU times: total: 0 ns
Wall time: 7.51 ms
No description has been provided for this image
Adamax Optimizer Score¶

It held up pretty well despite our initial doubts.

In [71]:
%%time
chosen_arch_preds.update({'8_1_Adamax': all_models['8_1_Adamax'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_Adamax'], np.array(y_test), index='8_1_Adamax')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 46.9 ms
Wall time: 86.1 ms
Out[71]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_Adam 3.874307 1.433025 0.141670 0.139598
8_1_Nadam 3.840628 1.450489 0.157823 0.156640
8_1_RMSprop 3.843666 1.453355 0.216134 0.215187
8_1_SGD 4.126909 1.485601 -0.126507 -0.131760
8_1_Adagrad 14.886531 3.162995 0.271905 0.184537
8_1_Adadelta 18.399206 3.530084 0.263624 0.094465
8_1_Adamax 3.936407 1.455304 0.117124 0.117120

Optimizer Decision¶

The results are in! Time to choose an optimizer.

We'll take the best optimizer and move on to tuning the next hyperparameter, learning rate.

Optimizer Training Loss Plots¶

Let's look at the big picture by looking at all the training loss plots.

In [72]:
%%time
plot_training_loss_from_dict(optimizer_histories)
CPU times: total: 15.6 ms
Wall time: 54.1 ms
No description has been provided for this image

Optimizer Leaderboard¶

Based on these training loss plots, I'm leaning towards Adam or Nadam.

Let's compare the scores on the leaderboard again before our final decision.

In [73]:
%%time
optimizer_leaderboard_df[:]
CPU times: total: 0 ns
Wall time: 0 ns
Out[73]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_Adam 3.874307 1.433025 0.141670 0.139598
8_1_Nadam 3.840628 1.450489 0.157823 0.156640
8_1_RMSprop 3.843666 1.453355 0.216134 0.215187
8_1_SGD 4.126909 1.485601 -0.126507 -0.131760
8_1_Adagrad 14.886531 3.162995 0.271905 0.184537
8_1_Adadelta 18.399206 3.530084 0.263624 0.094465
8_1_Adamax 3.936407 1.455304 0.117124 0.117120

And the Winner Is...¶

Nadam!¶

Nadam has the best mean and squared errors. Its variance is not as good as Adam, but a crab's age has some wiggle room of a year or two because of how data is collected.

Let's tune the learning rate for Nadam next. We'll create a function with new compile options going forward.

In [74]:
%%time
def nadam_compile_options(learning_rate:float=0.001, loss_metric='mean_squared_error'):
    """Wrapper for common_compile_options with Nadam optimizer.

    :param learning_rate: learning rate for Nadam optimizer
    :param loss_metric: loss metric for the model. Default is 'mean_squared_error'.
    """
    return common_compile_options(
        optimizer=keras.optimizers.Nadam(learning_rate=learning_rate),
        loss_metric=loss_metric
    )
CPU times: total: 0 ns
Wall time: 0 ns

Learning Rate Tuning¶

So far, we've been using a static learning rate of 0.001. Let's try a few different learning rates.

  • 0.1
  • 0.01
  • 0.001
  • 0.0001
  • Scheduled

Learning Rate = 0.1 (Fast Learning)¶

Let's try a fast learning rate to see if it helps.

In [75]:
%%time
# cloning from Nadam
all_models['8_1_LR_0_1'] = keras.models.clone_model(all_models['8_1_Nadam'])

all_models['8_1_LR_0_1'].compile(**nadam_compile_options(learning_rate=0.1))

deep_8_1_LR_0_1_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_1.weights.h5'),
    **common_checkpoint_options
)

# initialize history dictionary
learning_rate_histories = {
    '8_1_LR_0_1': all_models['8_1_LR_0_1'].fit(
        **common_fit_options,
        callbacks=[deep_8_1_LR_0_1_checkpoint])}

all_models['8_1_LR_0_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_1.weights.h5'))

all_models['8_1_LR_0_1'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 315 (1.24 KB)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 197 (792.00 B)
CPU times: total: 1.03 s
Wall time: 8.12 s
Learning Rate = 0.1 Training Loss Plot¶

We're expecting a quick approximation and a lot of variance.

In [76]:
%%time
plot_training_loss(learning_rate_histories['8_1_LR_0_1'], '8-1 NN Model (LR=0.1)')
CPU times: total: 0 ns
Wall time: 8.51 ms
No description has been provided for this image
Learning Rate = 0.1 Score¶

Yikes! That's a lot of variance. Let's try a slower learning rate next. But first, let's put it on the leaderboard.

In [77]:
%%time
chosen_arch_preds = {'8_1_LR_0_1': all_models['8_1_LR_0_1'].predict(X_test).flatten()}
deep_model_scores_df = score_model(chosen_arch_preds['8_1_LR_0_1'], np.array(y_test), index='8_1_LR_0_1')
# Add it to the leaderboard
learning_rate_leaderboard_df = score_combine(pd.DataFrame(), deep_model_scores_df)
learning_rate_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 46.9 ms
Wall time: 98.4 ms
Out[77]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_LR_0_1 3.761772 1.446253 0.183375 0.17987

Learning Rate = 0.01 (Less Fast Learning)¶

Still not "slow" learning, but let's decelerate a bit to see if we can address the variance.

In [78]:
%%time
all_models['8_1_LR_0_01'] = keras.models.clone_model(all_models['8_1_Nadam'])

all_models['8_1_LR_0_01'].compile(**nadam_compile_options(learning_rate=0.01))

deep_8_1_LR_0_01_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_01.weights.h5'),
    **common_checkpoint_options)

learning_rate_histories['8_1_LR_0_01'] = \
    all_models['8_1_LR_0_01'].fit(
        **common_fit_options,
        callbacks=[deep_8_1_LR_0_01_checkpoint])

all_models['8_1_LR_0_01'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_01.weights.h5'))

all_models['8_1_LR_0_01'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 315 (1.24 KB)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 197 (792.00 B)
CPU times: total: 859 ms
Wall time: 8.03 s
Learning Rate = 0.01 Training Loss Plot¶

We're looking for a smoother curve with less variance.

In [79]:
%%time
plot_training_loss(learning_rate_histories['8_1_LR_0_01'], '8-1 NN Model (LR=0.01)')
CPU times: total: 0 ns
Wall time: 9.51 ms
No description has been provided for this image
Learning Rate = 0.01 Score¶

That's more like it. Let's put it on the leaderboard.

In [80]:
%%time
chosen_arch_preds = {'8_1_LR_0_01': all_models['8_1_LR_0_01'].predict(X_test).flatten()}
deep_model_scores_df = score_model(chosen_arch_preds['8_1_LR_0_01'], np.array(y_test), index='8_1_LR_0_01')
# Add it to the leaderboard
learning_rate_leaderboard_df = score_combine(learning_rate_leaderboard_df, deep_model_scores_df)
learning_rate_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 46.9 ms
Wall time: 105 ms
Out[80]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_LR_0_1 3.761772 1.446253 0.183375 0.179870
8_1_LR_0_01 3.712384 1.408267 0.238712 0.237596

Learning Rate = 0.001 (Slow Learning)¶

This is the learning rate we've been using, so we know what to expect.

Let's confirm our expectations and see how it compares to the others.

In [81]:
%%time
all_models['8_1_LR_0_001'] = keras.models.clone_model(all_models['8_1_Nadam'])

all_models['8_1_LR_0_001'].compile(**nadam_compile_options(learning_rate=0.001))

deep_8_1_LR_0_001_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_001.weights.h5'),
    **common_checkpoint_options)

learning_rate_histories['8_1_LR_0_001'] = \
    all_models['8_1_LR_0_001'].fit(
        **common_fit_options,
        callbacks=[deep_8_1_LR_0_001_checkpoint])

all_models['8_1_LR_0_001'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_001.weights.h5'))

all_models['8_1_LR_0_001'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 315 (1.24 KB)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 197 (792.00 B)
CPU times: total: 1.47 s
Wall time: 8.34 s
Learning Rate = 0.001 Training Loss Plot¶

This should look familiar.

In [82]:
%%time
plot_training_loss(learning_rate_histories['8_1_LR_0_001'], '8-1 NN Model (LR=0.001)')
CPU times: total: 0 ns
Wall time: 7.5 ms
No description has been provided for this image
Learning Rate = 0.001 Score¶

Add it to the leaderboard.

In [83]:
%%time
chosen_arch_preds = {'8_1_LR_0_001': all_models['8_1_LR_0_001'].predict(X_test).flatten()}
deep_model_scores_df = score_model(chosen_arch_preds['8_1_LR_0_001'], np.array(y_test), index='8_1_LR_0_001')
# Add it to the leaderboard
learning_rate_leaderboard_df = score_combine(learning_rate_leaderboard_df, deep_model_scores_df)
learning_rate_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 78.1 ms
Wall time: 93.6 ms
Out[83]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_LR_0_1 3.761772 1.446253 0.183375 0.179870
8_1_LR_0_01 3.712384 1.408267 0.238712 0.237596
8_1_LR_0_001 4.019154 1.480873 0.136125 0.136110

Learning Rate = 0.0001 (Slower Learning)¶

Let's slow down the learning rate even more. It might take a while to converge, but we expect less variance.

In [84]:
%%time
all_models['8_1_LR_0_0001'] = keras.models.clone_model(all_models['8_1_Nadam'])

all_models['8_1_LR_0_0001'].compile(**nadam_compile_options(learning_rate=0.0001))

deep_8_1_LR_0_0001_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_0001.weights.h5'),
    **common_checkpoint_options)

learning_rate_histories['8_1_LR_0_0001'] = \
    all_models['8_1_LR_0_0001'].fit(
        **common_fit_options,
        callbacks=[deep_8_1_LR_0_0001_checkpoint])

all_models['8_1_LR_0_0001'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_0001.weights.h5'))

all_models['8_1_LR_0_0001'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 315 (1.24 KB)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 197 (792.00 B)
CPU times: total: 2.06 s
Wall time: 9.04 s
Learning Rate = 0.0001 Training Loss Plot¶

This should be a slow and steady curve.

In [85]:
%%time
plot_training_loss(learning_rate_histories['8_1_LR_0_0001'], '8-1 NN Model (LR=0.0001)')
CPU times: total: 0 ns
Wall time: 8.51 ms
No description has been provided for this image
Learning Rate = 0.0001 Score¶

This one is acting as expected. In an ideal world, we would give every model more epochs, but for the sake of time, we'll stick to 100 epochs and consider this 'too slow' for this project.

In [86]:
%%time
chosen_arch_preds = {'8_1_LR_0_0001': all_models['8_1_LR_0_0001'].predict(X_test).flatten()}
deep_model_scores_df = score_model(chosen_arch_preds['8_1_LR_0_0001'], np.array(y_test), index='8_1_LR_0_0001')
# Add it to the leaderboard
learning_rate_leaderboard_df = score_combine(learning_rate_leaderboard_df, deep_model_scores_df)
learning_rate_leaderboard_df.sort_index()[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 15.6 ms
Wall time: 92.8 ms
Out[86]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_LR_0_0001 5.629026 1.729137 0.024360 0.004204
8_1_LR_0_001 4.019154 1.480873 0.136125 0.136110
8_1_LR_0_01 3.712384 1.408267 0.238712 0.237596
8_1_LR_0_1 3.761772 1.446253 0.183375 0.179870

Scheduled Learning Rate¶

Let's use what we learned from simulated annealing to schedule the learning rate.

Learning rate (0.01) has the best scores so far.

Our scheduled learning rate can start here and decrease by $X$% every $Y$ epochs of no improvement.

We learned from an earlier experiment that these networks commonly plateau but continue to learn after a while, so we want to give it a chance to learn.

We'll use a ReduceLROnPlateau callback to adjust the learning rate based on the validation loss.

  • Factor = 0.75: The factor by which the learning rate will be reduced. new_lr = lr * factor.
  • Patience = 9: Number of epochs with no improvement after which learning rate will be reduced.

These values were chosen based on some experimentation (not shown here for brevity).

I wonder if we can schedule the schedule's schedule... (We can, but we won't here.)

In [87]:
%%time
all_models['8_1_LR_S'] = keras.models.clone_model(all_models['8_1_Nadam'])

all_models['8_1_LR_S'].compile(**nadam_compile_options(learning_rate=0.01))

deep_8_1_LR_S_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_S.weights.h5'),
    **common_checkpoint_options
)

learning_rate_schedule = keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.75,
    patience=9,
    verbose=1,
    mode='min'
)

learning_rate_histories['8_1_LR_S'] = \
    all_models['8_1_LR_S'].fit(
        **common_fit_options,
        callbacks=[deep_8_1_LR_S_checkpoint, learning_rate_schedule])

all_models['8_1_LR_S'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_S.weights.h5'))

all_models['8_1_LR_S'].summary()
Epoch 75: ReduceLROnPlateau reducing learning rate to 0.007499999832361937.

Epoch 89: ReduceLROnPlateau reducing learning rate to 0.005624999874271452.
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 315 (1.24 KB)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 197 (792.00 B)
CPU times: total: 2.08 s
Wall time: 7.96 s
Learning Rate Schedule Training Loss Plot¶

Let's look for an improvement in the training loss rate over epochs.

In [88]:
%%time
plot_training_loss(learning_rate_histories['8_1_LR_S'], '8-1 NN Model (LR=Scheduled)')
CPU times: total: 0 ns
Wall time: 7.51 ms
No description has been provided for this image
Scheduled Learning Rate Score¶
In [89]:
%%time
chosen_arch_preds = {'8_1_LR_S': all_models['8_1_LR_S'].predict(X_test).flatten()}
deep_model_scores_df = score_model(chosen_arch_preds['8_1_LR_S'], np.array(y_test), index='8_1_LR_S')
# Add it to the leaderboard
learning_rate_leaderboard_df = score_combine(learning_rate_leaderboard_df, deep_model_scores_df)
learning_rate_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 15.6 ms
Wall time: 85.1 ms
Out[89]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_LR_0_1 3.761772 1.446253 0.183375 0.179870
8_1_LR_0_01 3.712384 1.408267 0.238712 0.237596
8_1_LR_0_001 4.019154 1.480873 0.136125 0.136110
8_1_LR_0_0001 5.629026 1.729137 0.024360 0.004204
8_1_LR_S 3.750417 1.413749 0.159442 0.158437

The scheduled learning rate has the best error stats so far. But not so fast, let's take a look at the big picture.

Learning Rate Decision¶

Let's compare the training loss plots and the leaderboard scores for all the learning rates.

Reminder of our criteria:

  • Mean Absolute Error within 2 years.
  • Reasonable Explained Variance Score
  • Reasonable R2 Score
  • Avoid Overfitting
  • Reasonable Learning Rate

Learning Rate Training Loss Plots¶

In [90]:
%%time
plot_training_loss_from_dict(learning_rate_histories)
CPU times: total: 15.6 ms
Wall time: 37.5 ms
No description has been provided for this image

Learning Rate Leaderboard¶

Check the leaderboard again before the big decision.

In [91]:
%%time
learning_rate_leaderboard_df[:]
CPU times: total: 0 ns
Wall time: 0 ns
Out[91]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_LR_0_1 3.761772 1.446253 0.183375 0.179870
8_1_LR_0_01 3.712384 1.408267 0.238712 0.237596
8_1_LR_0_001 4.019154 1.480873 0.136125 0.136110
8_1_LR_0_0001 5.629026 1.729137 0.024360 0.004204
8_1_LR_S 3.750417 1.413749 0.159442 0.158437

And the Winner Is...¶

Scheduled Learning Rate!¶

Spending a little extra time on the learning rate paid off. It has the best error scores and an acceptable variance.

Specifically, we are using ReduceLROnPlateau for our schedule.

  • Factor = 0.75: The factor by which the learning rate will be reduced. new_lr = lr * factor.
  • Patience = 9: Number of epochs with no improvement after which learning rate will be reduced.

Others exist (like ExponentialDecay), but this one worked for us this time.

So far we have chosen the (8-1) neural network architecture with the Nadam optimizer and a scheduled learning rate.

Loss Function to Mean Absolute Error¶

Let's try a different loss function to see if it improves the model.

  • Loss Function
    • Mean Absolute Error (MAE)
      • Less sensitive to outliers.
      • Penalizes all errors equally.

This could be good for our model, as we removed outliers in the data cleaning step. Let's find out.

We'll keep the best architecture so far and change the loss function to MAE.

In [92]:
%%time
all_models['8_1_MAE'] = keras.models.clone_model(all_models['8_1_LR_S'])

all_models['8_1_MAE'].compile(**nadam_compile_options(
    learning_rate=0.01,
    loss_metric='mean_absolute_error'))

deep_8_1_MAE_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_MAE.weights.h5'),
    **common_checkpoint_options
)

loss_histories = {'8_1_MAE': 
    all_models['8_1_MAE'].fit(
        **common_fit_options,
        callbacks=[deep_8_1_MAE_checkpoint, learning_rate_schedule])}

all_models['8_1_MAE'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_MAE.weights.h5'))

all_models['8_1_MAE'].summary()
Epoch 57: ReduceLROnPlateau reducing learning rate to 0.007499999832361937.

Epoch 69: ReduceLROnPlateau reducing learning rate to 0.005624999874271452.

Epoch 86: ReduceLROnPlateau reducing learning rate to 0.004218749818392098.

Epoch 95: ReduceLROnPlateau reducing learning rate to 0.003164062276482582.
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ normalization (Normalization)   │ (None, 10)             │            21 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 8)              │            88 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 315 (1.24 KB)
 Trainable params: 97 (388.00 B)
 Non-trainable params: 21 (88.00 B)
 Optimizer params: 197 (792.00 B)
CPU times: total: 1.72 s
Wall time: 8.06 s

Lost Function Mean Absolute Error Training Loss Plot¶

***Note**: The loss function is different, so the scale will be different.*

In [93]:
%%time
plot_training_loss(loss_histories['8_1_MAE'], '8-1 NN Model (MAE)', y_lim=(0, 5))
CPU times: total: 0 ns
Wall time: 8.52 ms
No description has been provided for this image

Loss Function = Mean Absolute Error Score¶

We can't tell anything yet since it's a new scale. Let's check out the leaderboard with all the metrics.

In [94]:
%%time
chosen_arch_preds['8_1_MAE'] = all_models['8_1_MAE'].predict(X_test).flatten()
deep_model_scores_df = score_model(chosen_arch_preds['8_1_MAE'], np.array(y_test), index='8_1_MAE')
# Add it to the leaderboard
loss_leaderboard_df = score_combine(learning_rate_leaderboard_df.loc[['8_1_LR_S']], deep_model_scores_df)
loss_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 
CPU times: total: 31.2 ms
Wall time: 91.2 ms
Out[94]:
mean_squared_error mean_absolute_error explained_variance_score r2_score
8_1_LR_S 3.750417 1.413749 0.159442 0.158437
8_1_MAE 3.897357 1.405759 0.208332 0.201616

Mean Absolute Error Loss Function Observations¶

Not the improvement we were looking for. Consistently lagging behind the model trained with Mean Squared Error (MSE) loss.

Perhaps an Ensemble Will Help¶

But I'm running out of time. Let's move on to feature engineering with our best model so far.

Winner, Winner, Crab's for Dinner!¶

Reminder of our criteria:

  • Mean Absolute Error within 2 years.
  • Reasonable Explained Variance Score
  • Reasonable R2 Score
  • Avoid Overfitting
  • Reasonable Learning Rate

Our Best Model So Far¶

  • Architecture: (8-1) Neural Network
  • Optimizer: Nadam
  • Learning Rate: Scheduled
    • Start = 0.01
    • Factor = 0.75
    • Patience = 9 epochs
  • Loss Function: Mean Squared Error

This model should be quick to train to an acceptable level.

No Need to Save the Data¶

We didn't make any changes to the data, so we can pick this back up on the next step.

Onwards to Feature Engineering¶

See the next section for feature engineering.

<html link> for feature reduction. <localhost html link> for feature reduction.